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Abstract 

Background: The Millennium Cohort Study is a longitudinal cohort study designed in the late 1990s to evaluate 
how military service may affect long-term health. The purpose of this investigation was to examine characteristics 
of Millennium Cohort Study participants who responded to the open-ended question, and to identify and 
investigate the most commonly reported areas of concern. 

Methods: Participants who responded during the 2001-2003 and 2004-2006 questionnaire cycles were included in 
this study (n = 108,129). To perform these analyses, Latent Semantic Analysis (LSA) was applied to a broad open- 
ended question asking the participant if there were any additional health concerns. Multivariable logistic regression 
was performed to examine the adjusted odds of responding to the open-text field, and cluster analysis was 
executed to understand the major areas of concern for participants providing open-ended responses. 

Results: Participants who provided information in the open-ended text field (n = 27,916), had significantly lower 
self-reported general health compared with those who did not provide information in the open-ended text field. 
The bulk of responses concerned a finite number of topics, most notably illness/injury, exposure, and exercise. 

Conclusion: These findings suggest generalized topic areas, as well as identify subgroups who are more likely to 
provide additional information in their response that may add insight into future epidemiologic and military research. 



Background 

Qualitative data can provide epidemiologists with invalu- 
able information that cannot be captured by quantitative 
data alone. Open-ended survey responses are difficult to 
analyze quantitatively in a large-scale study due to time 
constraints and complexity of categorizing the responses 
in a consistent and unbiased way. Latent Semantic Analy- 
sis (LSA) provides a method for open-ended text analysis 
using sophisticated statistical and mathematical algo- 
rithms [1]. This method reveals subtle textual meaning 
using an automated approach that eliminates potential 
human bias and permits rapid coding of large amounts of 
data [2]. LSA is widely used in applications of informa- 
tion retrieval [1], spam filtering [3], and automated essay 
scoring [4]. To date, modest assessments of LSA's func- 
tionality for open-ended text responses have shown pro- 
mising results [5], opening the field of large-scale 
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application of this technique to areas such as epidemiolo- 
gic survey research. 

This investigation explores the use of LSA to analyze 
open-ended responses from Millennium Cohort Study 
participants collected from 2001-2006 to investigate 
important health concerns that may not be covered by 
the structured questionnaire. Participant responses may 
also add value to existing research by providing more 
insight into emerging areas of concern. Additionally, it 
may prompt suggestions for refining future versions of 
the questionnaire by including previously omitted topics. 
The use of LSA for efficient and standardized analysis of 
open-ended responses from large-scale studies such as 
the Millennium Cohort will further epidemiological 
research by allowing researchers to gain deeper insight of 
populations under study. 

Methods 

Population and data sources 

This cross-sectional investigation is part of the larger 
Millennium Cohort Study, which was designed in the late 
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1990s to determine how military service may affect long- 
term health [6] . Those invited to participate in Panel 1 of 
the Millennium Cohort Study were randomly selected 
from all US military personnel, over sampling female ser- 
vice members, Reserve/National Guard service members, 
and those who had been previously deployed to south- 
west Asia, Bosnia, or Kosovo from 1998 through 2000, to 
ensure sufficient power to detect differences in smaller 
subgroups of the population. The probability-based sam- 
ple, representing approximately 11.3 percent of the 2.2 
million men and women in service as of October 2000, 
was provided by the Defense Manpower Data Center 
(DMDC) in California. Of the 77,047 individuals who 
enrolled (36 percent response rate) from July 2001 to 
June 2003 in Panel 1, 55,021 (71 percent follow-up rate) 
completed the first follow-up questionnaire between June 
2004 and February 2006. In addition to Panel 1, the 
invited participants of Panel 2 were randomly selected 
from military personnel with 1 to 2 years of service as of 
October 2003, and 31,110 enrolled (25 percent response 
rate). Marines and women were over sampled in this 
panel in order to ensure sufficient power among women 
as well as the most likely group of combat deployers. 
This investigation began with 108,157 consenting partici- 
pants who completed a questionnaire from either Panel 1 
(baseline and/or follow-up) or Panel 2 baseline. Investiga- 
tions of nonresponse to the first follow-up questionnaire 
found no appreciable bias as reflected by comparing mea- 
sures of association for selected outcomes using complete 
case and inverse probability weighting [7]. Participants 
with missing covariate data were removed from analyses. 
Demographic and military-specific data were obtained 
from electronic personnel files maintained by DMDC. 
Variables included sex, birth date, highest education 
level, marital status, race/ethnicity, past deployment to 
southwest Asia, Bosnia, or Kosovo between 1998 and 
2000, pay grade, service component (active duty and 
reserve/National Guard), service branch (Army, Navy, 
Coast Guard, Air Force, and Marine Corps), and 
occupations. 

The questionnaire consisted of 67 questions, including 
the open-ended question that read, "Do you have any con- 
cerns about your health that are not covered in this survey 
that you would like to share". While other questions 
allowed for free form text input, they were designed to 
accommodate only brief responses. The open-ended ques- 
tion was designed for participants to include as much 
information as they wanted, over any subject they wished 
to discuss. The huge variance in response topics made 
simplistic dictionary analysis of the open-ended response 
untenable. In addition, dictionary based analyses are 
unable to account for polysemy, a situation where one 
word can have multiple meanings (e.g., back can mean 
back pain, baclcwards, or previous in time). 



Latent Semantic Analysis 

LSA is a fully automatic mathematical/statistical technique 
for extracting and inferring meaningful relations from the 
contextual usage of words [8,9]. Using LSA software devel- 
oped by Pearson Knowledge Technologies, lexical analysis 
was performed on the responses to the final question, 
which asks participants to share any other health concerns 
not covered in the structured instrument. This allowed for 
identifying semantic similarities among open text 
responses to determine clusters of responses with high 
contextual similarity (e.g., noting that "welding fumes" and 
"asbestos" have similar meaning within the context of this 
study). LSA overcomes the limitations of simple diction- 
ary-based analysis because it determines meaning from 
contextual similarity, rather than human defined syno- 
nyms and related words. 

The first step in applying LSA to the analysis of open- 
ended responses was to create a semantic space, "a math- 
ematical representation of a large body of text[s]" [9], 
using a corpus of medical and military documents as well 
as the text of the questionnaire itself and the open-ended 
responses. The semantic space was generated from 
1,862,972 medical and military documents comprising 
435,456 unique terms. These documents included medi- 
cal journal articles containing health related writings, 
military documents replete with jargon and geographical 
locations, plus common English language works. In addi- 
tion, the open-ended responses were included in the 
semantic space in order to identify semantic similarities 
that would not exist outside the context of an open- 
ended response. To reduce complexity, the size of the 
semantic space was optimized by LSA to have n = 300 
dimensions. Data were then filtered by removing 
responses that conveyed no information about the health 
of the participant (e.g., "No," "N/A," "I have nothing to 
say"). This removed entire responses from the analysis, 
an important distinction from the common tactic of 
employing a "stop list", which removes common words 
(e.g., "and", "the", etc.) from specific responses. In this 
analysis, every word in every response was considered for 
analysis; only the responses determined to convey no 
meaning were removed. Once identified, those indivi- 
duals with meaningless responses (« = 33,951) were 
included in the group of participants who did not 
respond to the open-ended question. Upon human exam- 
ination, 25 (0.1 percent) responses were originally classi- 
fied as meaningless that were subsequently reclassified as 
meaningful. To investigate the number of responses mis- 
classified as meaningful, a random sample of 250 
responses originally classified as meaningful were 
reviewed by humans. Of these, only 5 (2.0 percent) were 
judged to be actual meaningless responses. Therefore, 
the classification method biased slightly toward categor- 
izing responses as meaningful rather than the opposite. 
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Implications of this small amount of misclassification are 
expected to have minimal effects on our study findings. 

A set of 1025 clustering terms was created by select- 
ing words from the meaningful responses that each 
appeared more than 70 times (excluding words in a 
high-frequency stop list; a stop list was not used in the 
creation of the semantic space). LSA was used to com- 
pute a dissimilarity measure by computing the cosine 
between each pair of terms in the set to produce a dis- 
tance matrix. The set of terms was partitioned into 20 
non-overlapping clusters using a variant of the k-means 
clustering algorithm, called the pam (for "partitioning 
around medoids") function from the R language cluster 
package. Twenty clusters were chosen since more than 
20 clusters gave redundant or overlapping clusters, or 
clusters that were not relevant to the medical domain 
(e.g. measures of time, military terms). Fewer than 20 
clusters did not provide sufficient separation into sepa- 
rate categories. Each cluster was represented by its 
medoid, the term most central in the cluster. Meaning- 
ful responses were assigned to clusters by computing 
the similarity between each response and each cluster 
medoid. If the cosine between a response and a medoid 
(representing the vector distance between a given 
response and the cluster medoid) was greater than 0.2, 
the response was assigned to that cluster. The clusters 
were then ranked based on how many responses they 
contained. The 20 clusters that accounted for the most 
responses were examined to determine their semantic 
meaning. However, not all of the top-20 clusters had 
discernable semantic meaning; some clusters appeared 
to be an artifact of the LSA technology (e.g., the cluster 
described by the following terms: a lot, don't, haven't, 
isn't, believed). For this exploratory analysis, the clusters 
without obvious semantic meaning were not included 
due to the difficulty determining the topic of concern. 
Responses could be assigned to multiple clusters, though 
this occurred infrequently. This analysis resulted in 
24,181 (86.6 percent) of the 27,916 meaningful 
responses being assigned to at least one area of concern 
(represented by membership in a cluster). 

Statistical analysis 

Descriptive and quantitative analyses of demographic 
characteristics among those who did and did not 
respond to the open-ended question were performed. 
Multivariable logistic regression modeling was used to 
investigate associations between demographic character- 
istics and whether they responded to the open-ended 
text question. A separate logistic regression model was 
run for Panel 1 baseline, Panel 1 follow-up, and Panel 2 
baseline populations. All statistical data analyses were 
performed using SAS statistical software version 9.2 
(SAS Institute Inc., Cary, NC). 



Results 

The semantic space was generated from 1,862,972 medical 
and military documents comprising 435,456 unique terms 
using 300 dimensions. Of the 108,157 eligible participants, 
19 were removed due to missing information for educa- 
tion and marital status, leaving 108,138 participants for 
analyses. Of the 108,138 participants in the study who 
completed 163,159 surveys from 2001-2006 (encompass- 
ing Panel 1 baseline and follow-up, and Panel 2 baseline), 
61,507 surveys (37.7 percent) had a response in the open- 
ended field. There were 670 unique null patterns (indicat- 
ing a meaningless response) identified, resulting in 33,591 
of the open-ended responses (54.6 percent) being classified 
as having a meaningless response. Subsequently, 27,916 
(45.4 percent of open-ended responses, 17.1 percent of all 
completed surveys) were classified with meaningful 
responses. 

Table 1 describes characteristics of Millennium Cohort 
Study participants who responded to the open-ended 
question, stratified by panel and survey. Open-ended 
responders were generally representative of their overall 
panel characteristics. However, for all three groups, a 
higher proportion of open-ended responders were older, 
on active duty, Army members, and combat specialists. 
Education level did not have a significant effect on 
response to the open ended question. In addition, open- 
ended responders were more likely to self-report good, 
fair, or poor general heath compared with those who did 
not provide an open-ended response who were more 
likely to report very good or excellent health. 

The adjusted odds of response to the open-ended ques- 
tion for each of the respective response groups are dis- 
played in Table 2. Increased adjusted odds of response to 
the open-ended question were found in personnel with 
service in the Army, Navy/Coast Guard, and the Marine 
Corps in comparison with Air Force members. Cohort 
members who were older, serving on active duty and in 
combat specialties were significantly more likely to 
respond to the open-ended question across all panels. 
Black non-Hispanic participants were significantly less 
likely to respond than white non-Hispanic participants. 
Among all panels, those who indicated fair or poor health 
were nearly three times more likely to respond when com- 
pared with those reporting very good or excellent health. 
Panel 1 women were more likely than men to provide a 
meaningful open-ended response, while no sex difference 
was observed among Panel 2 participants. Panel 1 baseline 
participants with deployment experience between 2001 
and 2007 in support of the operations in Iraq and Afghani- 
stan were less likely to respond to the open-ended ques- 
tion. However, Panel 1 follow-up and Panel 2 baseline 
participants with deployment experience in support of the 
operations in Iraq and Afghanistan were more likely to 
respond to the open-ended question. 



Table 1 Characteristics of Millennium Cohort Study Participants Who Provided a Meaningful Response for the Open-Ended Question 



Characteristic 

Sex 
Male 
Female 
Birth year 
Before 1960 
1960-1969 
1970-1979 
1980 or later 
Education 
High school or less 

Some college 
Bachelor's degree 
Advanced degree 
Marital status 

Married 
Not married 
Race/ethnicity 
White non-Hispanic 
Black non-Hispanic 
Other 

2001-2007 deployment 0 
No 
Yes 
Military rank 
Enlisted 
Officer 
Service component 
Reserve/Guard 
Active duty 
Branch of service 
Air Force 
Army 
Navy/Coast Guard 
Marine Corps 
Occupational category 



Panel 1 Baseline Panel 1 Follow-up Panel 2 Baseline 



All responders 
n = 77,042 


Open-text responders 3 
n = 14,692 


All responders 
n = 55,021 


Open-text responders 3 
n = 8,937 


All responders 
n = 31,096 


Open-text responders 3 
n = 4,287 


% b 


% b 


% b 


% b 


% b 


% b 



73.2 
26.8 

21.6 
37.9 
34.6 
5.9 

48.9 
25.5 
16.5 
9.1 

63.1 
36.9 

69.6 
13.8 
16.7 

57.6 
42.5 

77.0 
23.0 

43.0 
57.0 

29.0 
47.4 
18.5 
5.1 



73.2 
26.8 

24.2 
39.2 
31.8 
4.7 

48.9 
24.3 
16.7 
10.1 

64.1 
35.9 

70.6 
11.6 
17.8 

61.4 
38.6 

75.7 
24.3 

36.8 
63.2 

25.7 
48.1 
20.6 
5.6 



73.3 
26.7 

24.5 
40.5 
30.8 
4.2 

45.6 
17.8 
22.1 
14.5 

73.3 
26.7 

70.8 
12.2 
16.9 

56.3 
43.6 

70.8 
29.2 

53.4 
46.6 

30.3 
47.7 
18.1 
4.0 



72.8 
27.2 

28.1 
40.5 
28.1 
3.3 

43.3 
18.4 
22.7 
15.6 

72.8 
27.2 

69.9 
11.0 
19.1 

56.1 
43.9 

69.4 
30.6 

51.3 
48.7 

24.7 
52.1 
18.8 
4.3 



61.6 
38.4 

0.7 
5.4 
31.9 
62.0 

81.4 
3.2 

12.3 
3.1 

28.1 
71.9 

71.2 
11.6 
17.1 

42.3 
57.7 

88.4 
11.6 

40.0 
60.0 

26.6 
48.2 
16.9 
8.3 



63.4 
36.6 

0.9 
6.3 
35.6 
57.2 

80.6 
4.0 
12.7 
2.7 

29.0 
71.0 

72.2 
10.2 
17.6 

38.4 
61.6 

89.4 
10.6 

36.7 
63.3 

17.9 
55.0 
16.8 
10.3 



Table 1 Characteristics of Millennium Cohort Study Participants Who Provided a Meaningful Response for the Open-Ended Question (Continued) 



Others 


69.9 


69.0 


69.4 


68.3 


72.5 


72.0 


Combat specialists 


20.0 


21.2 


19.2 


20.3 


15.7 


1 9. 1 


Heath care specialists 


1 0/1 


9.8 


I I /I 


1 1.4 


I 1 .8 


8.9 


General health d 














Very good/excellent 


59.0 


48.9 


55.3 


45.6 


54.3 


41.4 


Good 


30.3 


35.1 


34.5 


37.4 


33.1 


37.8 


Fair/poor 


7.7 


13.3 


8.9 


15.9 


8.9 


17.7 


Missing 


3.0 


2.7 


1.3 


1.0 


3.6 


3.1 



3 Includes participants who had a meaningful response to the open-ended question, "Do you have any concerns that are not covered in this survey that you would like to share?" 
b Percentages were rounded and may not sum to 100. 

c Any deployment in support of the wars in Iraq and Afghanistan September 2001 -October 2007. 

d Self-reported general health from the question, "In general, would you say your health is excellent, very good, good, fair, or poor?" 
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Table 2 Adjusted Odds of Response to the Open-Ended Question by Characteristics of Millennium Cohort Study 
Participants 

Adjusted Odds of Response to Open-Ended Question 3 
Panel 1 Baseline Panel 1 Follow-up Panel 2 Baseline 



Characteristic 



n = 74,664 
AOR 95% CI 



AOR 



54,250 

95% CI 



AOR 



n = 29,902 

95% CI 



Sex 

Male 

Female 
Birth year 

Before 1960 

1960-1969 

1970-1979 

1980 or later 
Education 

High school or less 

Some college 

Bachelor's degree 

Advanced degree 
Marital status 

Married 

Not married 
Race/ethnicity 

White non-Hispanic 

Black non-Hispanic 

Other 

2001-2007 deployment 13 

No 

Yes 
Military rank 

Enlisted 

Officer 
Service component 

Reserve/Guard 

Active duty 
Branch of service 

Air Force 

Army 

Navy/Coast Guard 
Marine Corps 
Occupational category 
Others 

Health care specialists 
Combat specialists 
General health c 

Very good/excellent 

Good 

Fair/poor 



ref 
1.07* 

1.00 
0.83* 
0.65* 
0.52* 

ref 
1.03 
1.07 
1.07 

ref 
1.09* 

ref 
0.71* 
0.95* 

ref 



ref 
1.07 

ref 
1.50* 

ref 
1.30* 
1.26* 
1.42* 

ref 
0.90* 
1.07* 

ref 
1.55* 
2.66* 



1.02, 1.12 



0.79, 0.87 
0.61, 0.96 
0.47, 0.58 



0.98, 1 .09 
0.99, 1.15 
0.97, 1.18 



1.04, 1.14 



0.67, 0.75 
0.90, 1 .00 



0.84, 0.91 



0.99, 1.15 



1.44, 1.57 



1 .24, 1 .38 
1.18, 1.34 
1 .30, 1 .56 



0.84, 0.96 
1.02, 1.13 



1.49, 1.61 

2.50, 2.84 



ref 

1.09* 

1.00 
0.81* 
0.71* 
0.57* 

ref 
1.09* 
1.13* 
1.17* 

ref 
1.06* 

ref 
0.82* 
1.07* 

ref 
1.13* 

ref 
1.05 

ref 
1.14* 

ref 
1.43* 
1.35* 
1.56* 

ref 
1 .00 



ref 
1.47* 
2.79* 



1.03, 



0.76, 0.86 
0.67, 0.76 
0.50, 0.65 



1.02, 1.16 

1.05, 1.22 

1.06, 1.29 



1.01, 1.12 



0.76, 0.88 
1.00, 1.14 



0.97, 1.14 



1.09, 1.20 



1.35, 1.52 
1.26, 1.45 
1.38, 1.76 



0.93, 1.08 
1.02, 1.15 



1.39, 1.54 
2.59, 3.00 



ref 
1.00 

1.00 
0.78 
0.64* 
0.49* 

ref 
1.33* 
1.17* 

1.15 

ref 
1.06 

ref 
0.80* 
0.99 

ref 
1.10* 

ref 
1.06 

ref 
1.32* 

ref 
1.72* 
1.39* 
1.82* 

ref 
0.76* 
1.18* 

ref 
1.60* 
3.08* 



0.92, 1.07 



0.53, 1.15 
0.44, 0.93 
0.34, 0.71 



1.11, 1.59 
1 .00, 1 .37 
0.88, 1 .50 



0.98, 1.14 



0.72, 0. c 
0.90, 1 X 



1.02, 1.18 



0.88, 1 .27 



1.22, 1.43 



1 .57, 1 .88 
1 .24, 1 .55 
1.59, 2.08 



0.67, 0.86 
1 .07, 1 .29 



1 .48, 1 .72 
2.79, 3.41 



indicates statistical significance at the a = 0.05 level, with a 95% confidence interval that excluded 1.00. 
a Includes participants who had a meaningful response to the open-ended question, "Do you have any concerns that 
would like to share? A separated logistic regression model was run for panel 1 baseline, panel 1 follow-up, and panel 
b Any deployment in support of the wars in Iraq and Afghanistan September 2001-October 2007. 

c Self-reported general health from the question, "In general, would you say your health is excellent, very good, good, fair, or poor?" 



are not covered in this survey that you 
2 baseline populations. 
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Table 3 shows some example responses, as well as 
their associated clusters. Each row represents one clus- 
ter, with an example participant response displayed. 
Although the illness/injury cluster includes both chronic 
and acute concerns, blood pressure medication was the 
most commonly expressed issue. Exposure concerns 
were mostly either workplace hazards (e.g. toxic chemi- 
cals) or deployment concerns (e.g., being around strange 
chemicals during deployment). The responses classified 
in the exercise cluster mainly focused on fitness, 
although some responses overlapped between exercise 
and injury. Mental health included a wide range of 
responses, from childhood abuse to concerns about 
postdeployment readjustment. Although not readily 
apparent using human analysis, anxiety was identified as 
a separate cluster from mental health using LSA. Vacci- 
nation concerns were frequently expressed, even though 
the structured questionnaire contained a few vaccine 
questions. 

The most frequently expressed areas of concern are 
shown in Table 4. Responders to the open-ended question 
most frequently expressed a concern with an illness or 
injury (28.0 percent). Terms present in the response that 
represented illness or injury concerns included words such 
as "suffered," "recovered," and "developed." Some of the 
other more frequently expressed areas of concern were 
exposure, discussed in 13.6 percent of open-ended 
responses and indicated by words such as "chemicals," 
"radiation," and "asbestos"; and exercise, discussed in 11.0 
percent of open-ended responses, represented by terms 
such as "walking," "biking," and "vigorous". Other com- 
mon concerns were back pain (8.8 percent), deployment 
(7.6 percent), arm symptoms (7.4 percent), mental health 
(7.2 percent), weight (6.3 percent), vaccination (4.5 per- 
cent), anxiety/ disorientation (3.5 percent), and surgery (2.1 
percent). Panel 1 open-ended responders more frequently 
expressed concerns about deployment at follow-up (8.3 



percent) compared with baseline (7.1 percent). Compared 
with the total study population, a greater proportion of 
Panel 1 follow-up and Panel 2 baseline responders, who 
both filled out their respective survey from 2004-2006, 
indicated concerns about deployment and mental health. 

Discussion 

As computing capabilities grow, researchers are increas- 
ingly given opportunities to use complex and computa- 
tionally intensive analytic techniques to answer scientific 
questions. Confronted with practical challenges of analyz- 
ing open-text responses, LSA offers a comprehensive 
method for efficient and standardized analysis of these 
data. In this exploratory analysis, we found subgroups of 
the population that were more likely to use the open-text 
response option. Of greatest interest are those who 
reported poor general health and their propensity to use 
the open-text field. Since these individuals may be of high 
concern in health research, this text field yields additional 
valuable insight not otherwise assessed. 

Limited research exists on the characteristics of indivi- 
duals who choose to provide additional information as 
part of an optional open-ended text field on a survey. The 
strongest association observed in this study was that parti- 
cipants with poorer self-reported general health were sig- 
nificantly more likely to respond within the open-ended 
text field, and the likelihood of response increased as self- 
reported health status decreased. Interestingly, in the 
entire Millennium Cohort, it has been shown that there is 
not a significant association between health status and 
likelihood of enrollment [10]. However, it is important to 
note that all of the individuals in this current study were 
already participants in the Millennium Cohort Study; 
therefore, even though they may not have enrolled based 
on their health status, perhaps health status motivated 
them to provide additional information in the open-ended 
field. Those with poor self-perceived general health may 



Table 3 Example Responses From Millennium Cohort Study Participants Within the Top Seven Concerns Expressed in 
the Open-Ended Question 

Area of Concern 3 Example Response 

lllness/injury b I recently had my blood pressure medication dose increased to control hypertension 

Illness/injury I was involved in a motor vehicle collision.Jt has caused delays in my return to reserve duty/flight duty. I suffered a head 

injury/laceration and orthopedic injury/laceration to left knee. 

Exposure Exposed to hepatitis, asbestos, and enriched uranium in Uzbekistan and Afghanistan. 

Exposure to welding fumes. 

Exercise Lower back, knee, and ankle pain due to extended periods of massive weight-bearing duties and exercise. 

Mental health Mental and emotional problems due to sexual child abuse. 

Anxiety/ Extreme stress and anxiety due to superiors' incompetence, 

disorientation 

Vaccination Allergic reactions to anthrax vaccine. 

a A single participant response could be categorized into multiple areas of concern. 

''The cluster labeled "illness/injury" describes responses across a broad number of concerns. Several examples are provided in Table 4 to better illustrate these 
topic areas within the cluster. 



Table 4 Most Frequently Expressed Areas of Concern Among Millennium Cohort Study Participants Responding to the Open-Ended Text Question 



Area of Concern 3 Total P1 Baseline Responses P1 Follow-up Responses P2 Baseline Responses Related Terms' 3 

n = 10,214 n = 5,626 n = 3,297 n = 1,291 





n 


% 


n 


% 


n 


% 


n 


% 




Illness/injury 


2,859 


28.0 


1,433 


25.5 


1,033 


31.3 


393 


30.4 


suffered, recovered, developed 


Exposure 


1,385 


13.6 


887 


15.8 


328 


10.0 


170 


13.2 


chemicals, radiation, asbestos 


Exercise 


1,125 


11.0 


613 


10.9 


391 


11.9 


121 


9.4 


walking, biking, vigorous 


Back pain 


903 


8.8 


482 


8.6 


313 


9.5 


108 


8.4 


discs, herniation, lumbar 


Deployment regions/concerns 


780 


7.6 


399 


7.1 


275 


8.3 


106 


8.2 


Bosnia, barracks, DU 


Arm 


75-1 


7.4 


465 


8.3 


225 


6.8 


64 


5.0 


elbow, pronate, grip 


Mental health 


735 


7.2 


385 


6.8 


240 


7.3 


110 


8.5 


emotional, interpersonal, anxiety 


Weight concerns 


647 


6.3 


354 


6.3 


206 


6.3 


87 


6.7 


lose, dieting, obesity 


Vaccination 


459 


4.5 


299 


5.3 


114 


3.5 


■-16 


3.6 


VAERS, influenza, boosters 


Anxiety/disorientation 


355 


3.5 


185 


3.3 


117 


3.6 


53 


4.1 


shortness, sweating, tiredness 


Surgery 


212 


2.1 


124 


2.2 


55 


1.7 


33 


2.6 


removed, tape, wrapped 



Participants were able to provide a response in more than one area at multiple time points 
b Example terms included in the same cluster that is described by the Area of Concern 
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be more likely to report symptoms [11], or perhaps they 
have a desire to explain their poor health in greater detail 
than do healthier individuals. Regardless of why indivi- 
duals with poorer self-reported general health are more 
likely to respond to the open-ended question, this finding 
should be considered when conducting future analyses of 
response bias in the Millennium Cohort. 

With nearly 1 in 5 respondents choosing to include 
information in the open text field, it is important to know 
their characteristics. Adjusted data interestingly suggest 
some weak patterns, albeit significant, in response to the 
open text field differentiated by sex, age, active-duty status, 
and combat occupations. Air Force personnel were least 
likely to include a meaningful response to the question, 
but were also most likely to respond and respond early to 
the initial invitation for enrollment [6,12]. Combat specia- 
lists and Marine Corps members were also more likely to 
respond to the open text question, which may be attribu- 
table to the ongoing combat operations in Iraq and Afgha- 
nistan. Other findings of education status indicate that 
response rates generally increase as education level 
increases; this does not hold true for the open ended 
response. This non effect could be attributed to the free 
form nature of the open-ended text field; reading compre- 
hension of the participant may be less of an issue when 
compared with the structured instrument. 

Another interesting finding is that illness/injury was by 
far the most frequently expressed area of concern. This 
may suggest that physical or emotional ailments cause 
concern for people; either about how or why illness or 
injury occurred, or how these ailments may affect their 
short- or long-term quality of life. It is also worth noting 
that a higher proportion of individuals reported concerns 
regarding either illness/injury or deployment on the 
2004-2006 assessment compared with the 2001-2003 
assessment. This may be a reflection of the increased 
deployments to Iraq and Afghanistan as the conflicts 
continued to heighten over this time period. With only 
one follow-up data point available for the present study, 
it was difficult to fully understand this relationship; how- 
ever, it will be interesting to examine whether these con- 
cerns persist at the same or increased levels in the 2007- 
2008 and future assessments. 

The Millennium Cohort Study team re-examines the 
structured survey instrument between survey cycles, fre- 
quently adding questions that were not originally 
included in the previous instrument. Based in part on the 
open-ended text analysis described in this paper, several 
changes have been made: in 2004, physical activity ques- 
tions were added to the survey; in 2007 questions were 
added that focused on physical injury and deployment- 
specific exposures; in 2010, the physical injury section 
was supplemented, and questions on sleep length and 
quality were included. There was a very small proportion 



of responses related to very specific chemical exposures 
or other topics that were outside the scope of the survey, 
or very specific to a few individuals. The open ended 
question allows a channel for participants to raise aware- 
ness of newly identified, cutting edge topics that can help 
inform survey designers. 

There are some limitations to these analyses that 
should be mentioned. The study population consisted of 
a sample of responders to the Millennium Cohort ques- 
tionnaire and may not be representative of the military 
population. However, investigations of potential biases in 
the Millennium Cohort have found a well-representative 
military cohort who report reliable data and who are not 
influenced to participate by poor health prior to enroll- 
ment [6,10,13-20]. Latent Semantic Analysis is a techni- 
que to transform qualitative data into quantitative 
information, but it has limitations, including situations 
where meaning is determined contextually. Additionally, 
it is possible that non obvious underlying relationships 
existed within the top-20 automatically generated clus- 
ters, which could reveal more concerns that we were 
unable to detect. While these clusters were not included 
in the attached tables, they were included in the demo- 
graphic analysis. The greatest limitation to using LSA on 
open-ended text responses, however, is the vagueness in 
grouping certain responses together. LSA approximates 
semantic meaning (related concerns) by using mathema- 
tical transformations as a proxy; not all mathematically 
related responses were obviously similar. This made it 
more difficult to cleanly distinguish between different 
clusters when performing the final analysis. 

Despite these limitations, there are important strengths 
of this analysis. To our knowledge, this study is one of the 
first to apply LSA-based analyses to open-ended epidemio- 
logic survey responses from a large US military population. 
This is also one of the first studies to examine the open- 
ended text responses from US military personnel, includ- 
ing reserve/National Guard, and members who have left 
military service. Previous analyses on military populations 
used human assisted computer analysis, but generally had 
less sophisticated methodologies [21]. Once the initial 
semantic space is created, LSA is fully automatic, permit- 
ting rapid analysis of large sets of responses. Because 
knowledge of word meaning is not derived from thesauri, 
ontologies, or hand-coding of relationships among words 
or among responses, bias from human coders and inter- 
pretation error is minimized. LSA can evaluate a word 
whose meaning is determined contextually (e.g., "we 
moved back," is differentiated from "hurt my back"). 
Furthermore, it can determine similarity among responses 
without accounting for word order or even if passages 
share no words in common [22]. We also examined the 
reliability of LSA versus human expert review of a random 
sample of 50 open-ended responses using the Kappa 
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coefficient [23], and found agreement between LSA and 
human review to be substantial to almost perfect for four 
out of five categories examined, bolstering confidence in 
the LSA technology. 

Conclusion 

Future directions of this work may include application of 
analyses to better define concerns within the Cohort. 
Comparisons between the structured response and open- 
ended sections could be used to evaluate the comprehen- 
sion of the structured instrument. Open-ended text can 
reveal additional issues of prominent importance to parti- 
cipants. Investigators are continually challenged with 
addressing symptom-based illness that may not be well- 
defined under previous disease paradigms, and open- 
ended responses among large populations are critical to 
understanding such complex syndromes [24] . In addition, 
as society increasingly prefers brief, text-based communi- 
cation for many health issues, analyses of written messages 
among populations may reveal important public health 
trends [25]. Computerized text-parsing tools such as LSA 
allow an objective review of text responses that would be 
otherwise impossible to standardize. LSA may be used to 
define health concerns with related context, and identify 
whether they represent large-scale concerns of a few indi- 
viduals or common concerns of a great many individuals. 
Results will continue to help drive directions of future 
research and survey content. Review of open-ended text 
with text-mining tools such as LSA is critical to allow par- 
ticipant voices to truly be heard, from within the bounds 
of large-scale epidemiologic survey studies. 
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